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Universites Paris VI & VII and VU University 

We consider full Bayesian inference in the multivariate normal 
mean model in the situation that the mean vector is sparse. The prior 
distribution on the vector of means is constructed hierarchically by 
first choosing a collection of nonzero means and next a prior on the 
nonzero values. We consider the posterior distribution in the frequen- 
tist set-up that the observations are generated according to a fixed 
mean vector, and are interested in the posterior distribution of the 
number of nonzero components and the contraction of the posterior 
distribution to the true mean vector. We find various combinations 
of priors on the number of nonzero coefficients and on these coeffi- 
cients that give desirable performance. We also find priors that give 
suboptimal convergence, for instance, Gaussian priors on the nonzero 
coefficients. We illustrate the results by simulations. 

1. Introduction. Suppose that we observe a vector X = {X\, . . . ,X n ) in 
R n such that 

(1.1) Xi = 0i + ei, i = l,...,n, 

for independent standard normal random variables £j and an unknown vec- 
tor of means — (^i; • • • ?^n)- ^Ve are interested in Bayesian inference on 6, 
in the situation that this vector is possibly sparse. 

Non-Bayesian approaches to this problem have recently been considered 
by many authors. Golubev [13] obtained results for model selection methods 
and threshold estimators for the mean-squared risk. Birge and Massart [4] 
treated the model within their general context of model selection by penal- 
ized least squares. Abramovich et al. [1] studied the performance of the False 
Discovery Rate method. The earlier work by Donoho and Johnstone [10] can 
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be viewed as studying the problem within an t r context. Many authors (see, 
e.g., [3, 21, 22] and references cited there) have investigated the connection 
to the LASSO or similar methods. 

Methods with a Bayesian connection were studied by George and Fos- 
ter [12], Zhang [20], Johnstone and Silverman [16, 17], Abramovich, Grin- 
shtein and Pensky [2] and Jiang and Zhang [15] . George and Foster [12] and 
Johnstone and Silverman [16] considered an empirical Bayes method, con- 
sisting of modeling the parameters 9± , . . . , n a priori as independently drawn 
from a mixture of a Dirac measure at and a continuous distribution, deter- 
mining an appropriate mixing weight by the method of (restricted) marginal 
maximum likelihood and finally employing the posterior median or mean. 
The second paper [2] motivated penalties, applied in a penalized minimum 
contrast scheme, by prior distributions on the parameters, and derived esti- 
mators for the number of nonzero 9i and the 6i, itself. The first is a posterior 
mode, but the estimator for 6, called "Bayesian testimation," does not seem 
itself Bayesian. (In fact, the Gaussian prior for the nonzero parameters in [2] 
will be seen to perform suboptimally in our fully Bayesian set-up.) Zhang [20] 
and Jiang and Zhang [15] obtain sharp results on (nonparametric) empirical 
Bayes estimators. 

Other related papers include [5-7, 14, 15, 19]. 

A penalized minimum contrast estimator can often be viewed as the mode 
of the posterior distribution, and it is helpful to interpret penalties accord- 
ingly. However, the Bayesian approach yields a full posterior distribution, 
which is a random probability distribution on the parameter space. It has 
both a location and a spread, and can be marginalized to give posterior 
distributions for any functions of the parameter vector of interest. It is this 
object that we study in this paper. Such full Bayesian inference was recently 
considered by Scott and Berger [18], who discussed various aspects not cov- 
ered in the present paper, but no concentration results. One example of our 
results is that the beta-binomial priors in [18], combined with moderately 
to heavy tailed priors on the nonzero means, yield optimal recovery. 

Sparsity can be defined in various ways. Perhaps the most natural defini- 
tion is the class of nearly black vectors, defined as 



Here p n is a given number, which in theoretical investigations is typically 
assumed to be o(n), as n-> oo. Sparsity may also mean that many means 
are small, but possibly not exactly zero. Definitions that make this precise 
use strong or weak £ s -balls, typically for s € (0,2). These are defined as, 



HPn] = {eeR n :#(l<i<n:6 l ^0)< Pn }. 
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m s \p n ] = {9£R n :- max i\6 {i] \ s < ( ^) }. 

{ n l<i<n u V n / J 

Because the nonzero coefficients in ^o[Pn] axe not quantitatively restricted, 
there is no inclusion relationship between this space and the weak and strong 
balls, although results for the latter can be obtained by projecting them into 
^o[Pn]- O n the other hand, the inclusion ^ s [p n ] C m s [p n ] holds for any s > 0. 

The extent of the sparsity, measured by the constant p n , is assumed un- 
known. Our Bayesian approach starts by putting a prior 7r n on this number, 
a given probability measure on the set {0, 1, 2, . . . , n}. Next we complete 
this to a prior on the set of all possible sequences 6 = (9\, . . . , 9 n ) in M n , by 
given a draw p from ir n , choosing a random subset S C {1, . .. , n} of car- 
dinality p, and choosing the corresponding coordinates (9i : i € S) from a 
density g$ on R s and setting the remaining coordinates [Oi'.i G S c ) equal 
to zero. Given this prior, Bayes's rule yields the posterior distribution of 9, 
as usual. We investigate the properties of this posterior distribution, in its 
dependence on the priors on the dimension and on the nonzero coefficients, 
in the non-Bayesian set-up where X follows (1.1) with 9 equal to a fixed, 
"true" parameter Qq. 

If the true parameter vector 9q belongs to A)bn]> then it is desirable 
that the posterior distribution concentrates most of its mass on nearly black 
vectors. One main result of the paper is that this is the case provided the 
prior probabilities ir n {p} decrease exponentially fast with the dimension p. 

The quality of the reconstruction of the full vector 9 can be measured by 
various distances. A natural one is the Euclidean distance, with square 

n 

ii0-0'f=E^-^) 2 - 
i=i 

If the indices of the p n nonzero coordinates of a vector in the model ^o[Pn] 
were known a priori, then the vector could be estimated with mean square 
error of the order p n . In [11] it is shown that, as n,p n — > oo with p n = o(n), 

inf sup P n: e\\9-8\\ = 2p n log(n/p n )(l + o(l)). 

6 ee£o\p n ] 

Here the infimum is taken over all estimators 8 = 9(X), and P n ^ denotes 
taking the expectation under the assumption that X is N n (9, 7)-distributed. 
In other words, the square minimax rate over -£o[Pn] is p n log(n/p n ), meaning 
that the unknown identity of the nonzero means needs to lead only to a 
logarithmic loss. 

The Bayesian approach is presumably adopted for the intuition provided 
by prior modeling, and is not necessarily directed at attaining minimax rates. 
However, for theoretical investigation, it is natural to take the minimax rate 
as a benchmark, and it is of particular interest to know which priors yield 
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a posterior distribution that concentrates most of its mass on balls around 
#o of square radius of order p n log(p n /n), or close relatives as p n (\ogn) r 
that loose (only) a logarithmic factor. A second main result of the paper is 
that the minimax rate is attained for many combinations of priors. It suffices 
that the priors 7r n decrease exponentially with dimension, and give sufficient 
weight to the true level of sparsity: for some c> 0, 

(1-2) vr„(p„) > exp(-cp n log(n/p n )). 

Furthermore, the priors on the nonzero coordinates should have tails that 
are not lighter than Laplace, and satisfy a number of other technical prop- 
erties. If inequality (1.2) fails, then the rate of contraction may be slower 
than minimax; we show that it is not slower than log(l/7r n (p n )). [The word 
"contraction" is in line with other literature on nonparametric Bayesian pro- 
cedures; with the present choice of metrics (which grow with n) the rates 
actually increase to infinity] 

More generally, we consider reconstruction relative to the i q metric for 
< q < 2, defined (without gth root) by 

n 

(1.3) d q (e,e') = J2\Oi-o'i\ q - 

i=i 

For q < 2 this "metric" is more sensitive to small variations in the coordinates 
than the square Euclidean metric, which is d,2- (For q < 1 the definition gives 
a true metric d q ; for 1 < q < 2 it does not.) From [11] the minimax rate over 
^o[Pn] for d q is known to be of the order 

(1-4) rl q = pM/ 2 {n/p n ). 

We show that the posterior "contraction" rate attains this order under con- 
ditions as in the preceding paragraph, and more generally characterize the 
rate in terms of log(l/-7r n (p n )). 

Besides nearly black vectors, we consider rates of contraction if 9q is in a 
weak £ s -ball. The minimax rate over m s [p n ] relative to d q is (see [10]) 

(1.5) < 8) , = n(^) a iog (, - )/2 (n/Pn). 

This is shown to be also the rate of posterior contraction under slightly 
stronger conditions on the priors than before: the prior on dimension must 
decrease slightly faster than exponential. Under the same conditions we 
also show that the posterior distribution has exponential concentration, and 
therefore contracts also in the stronger sense of (any, Euclidean) moments. 

A summary of these results is that good priors for the dimension decrease 
at exponential or, perhaps better, slightly faster rate, and good priors on 
the nonzero means have tails that are heavier than Laplace. We also show 
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that priors with lighter tails, such as the Gaussian, attain significantly lower 
contraction rates at true parameter vectors 9q that are not close to the 
origin. 

The structure of the article is as follows. In Section 2 we state the main 
concentration results. A practical algorithm, simulations and some pictures 
are presented in Section 3. Proofs are gathered at the end of the paper and 
in the supplementary Appendix [9] . 

1.1. Notation. We denote by a A b and a V b the minimum and maximum 
of two real numbers a, b, and write a < b if a < Cb for a universal constant C. 
The notation = means "equal by definition to." We call support of a vector 
9 = ,9 n ) 6 M n the set of indices of nonzero coordinates, and denote 

this by Se = {i € {1, ... ,n} :9i^ 0}. We set 9 s = (9i:i£ S), and let \S\ be 
the cardinality of a set S C {1, . . . , n}. 

2. Main results. Throughout the paper we consider a prior Il n on IR n 
constructed in three steps: 

(PI) A dimension p is chosen according to a prior probability measure 
7r n on the set {0, 1,2, ... , n}. 

(P2) Given p a subset Sc{l,...,n} of size \S\ =p is chosen uniformly 
at random from the (™) subsets of size p. 

(P3) Given (p, S) a vector 6s = {Oi :i £ S) is chosen from a probability 
distribution with Lebesgue density g$ on W (if p > 1), and this is extended 
to 9 G M n by setting the remaining coordinates 9s c equal to 0. 

For simplicity we use the same density gs for every set of a given dimension 
\S\, and will denote this also by g\g\. We also assume that the prior on 
dimension is positive, that is ir n (p) > for any integer p. 

Given the prior II n , Bayes's rule yields the posterior distribution B i— > 
H n (B\X), the conditional distribution of 9 given X if the conditional distri- 
bution of X given 9 is taken equal to the normal distribution N n (9,I). The 
probability Yi n {B\X) of a Borel set B C W 1 under the posterior distribution 
can be written 

E^O^(p)/QE|S| =P J( Os ,o )e Bni e 5^(^-gOn^S0(^)g5(gs)rfgs 

E; = o^(p)/e)E|5i=ja es ^-^)n^^)5s(%)^5 

Here (9s ,0) is the vector in M n formed by adding coordinates 9i = to 
9s = {9i '■ i £ 5 1 ), at the positions left open by S 1 C {1, . . . , n} (in the correct 
order of the coordinates and not at the end, as the notation suggests). This 
expression is somewhat unwieldy; we consider computation in Section 3. 

The posterior distribution is a random probability distribution on M n , 
which we study under the assumption that the vector X = (X\, . . . ,X n ) is 
distributed according to a multivariate normal distribution with mean vector 
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9q and covariance matrix the identity. We let P n fi Q T denote the expected 
value of a function T = T(X) under this distribution. 

We shall be interested in two aspects of the posterior distribution: its 
dimensionality and its ability to recover the mean vector 9. Because the 
conditions are simpler in the case that the nonzero coordinates are indepen- 
dent under the prior, in the first two results we assume that the densities gs 
in (P3) are of product form. Concrete examples of priors as in (PI) and (P3) 
that satisfy the conditions imposed in the theorems are given in Section 2.5. 

2.1. Dimensionality. In the context of £o[P"-]" c ^ asses ; we sa y that the 
prior 7r n on dimension has exponential decrease if, for some constants C > 
and D < 1, 

(2.2) 7T„(p) < DTT n {p - 1), p > C Pn . 

If the condition is also satisfied with C = 0, we say that the prior on dimen- 
sion has strict exponential decrease. 

Theorem 2.1 (Dimension). i/vr n has exponential decrease (2.2) and gs 
is a product of\S\ copies of a univariate density g, with mean zero and finite 
second moment, then there exists M > such that, as p n ,n — > oo, 

sup P nfio n n (9:\Se\ >Mp n \X)^0. 

For reasonable priors, we may hope that the posterior distribution spreads 
mass in the p n -dimensional subspace that supports a true mean vector 
9q G £q [p n ] . The theorem shows that the posterior distribution "overshoots" 
this space by subspaces of dimension at most a multiple of p n . Because the 
overshoot can have a random direction, this does not mean that the poste- 
rior distribution concentrates overall on a fixed Mp n -dimensional subspace. 
The theorem shows that it concentrates along Mp n -dimensional coordinate 
planes, but its support will be far from convex. 

Obviously the posterior distribution will concentrate on low-dimensional 
subspaces if the higher-dimensional spaces receive little mass under the prior 
TT n . By the theorem, exponential decrease is sufficient. The next step is to 
show that exponential decrease is not too harsh: it is compatible with good 
reconstruction of the full mean vector 9. This then, of course, requires a 
lower bound on the prior mass given to the spaces of "correct" dimension; 
for instance, see (1.2). 

2.2. Recovery. Good recovery requires also appropriate prior densities 
gs on the nonzero coordinates. Because the statistical problem of recovering 
9 from a N p (9,I) distributed observation is equivariant in 9, we may hope 
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that the location of the nonzero coordinates of 9q does not play a role in its 
recovery rate. The non-Bayesian procedures considered in, for instance, [13] 
indeed fulfill this expectation. However, a Bayesian procedure (with proper 
priors) necessarily favors certain regions of the parameter space. Depending 
on the choice of priors gs in (P3), this may lead to a shrinkage effect, even 
in the "average" recovery of the parameter as n— > oo, yielding suboptimal 
behavior for true parameters 9q that are far from the origin. This shrinkage 
effect can be prevented by choosing priors gs with sufficiently heavy tails. 

Again we first consider the case of independent coordinates. In the fol- 
lowing theorem we assume that gs is a product of | S"! densities of the form 
e h , for a function h : M. — > M. satisfying 

(2.3) \h{x)-h(y)\<l + \x-y\ Vx,yGM. 

This covers all densities e h with a uniformly Lipshitz function h, such as 
the Laplace and Student densities. (For the Student density the following 
theorem assumes more than 2 degrees of freedom to ensure also finiteness of 
the second moment.) It also covers other smooth densities with polynomial 
tails, and densities of the form c Q e~' x ' a for some a G (0,1], which have a 
function h that is bounded in a neighborhood of the origin and uniformly 
Lipschitz outside the neighborhood. On the other hand the standard normal 
density is ruled out. In Theorem 2.8 we shall see that this indeed causes a 
shrinkage effect. 

Recall definition (1.3) of the (square) distance d q . 

Theorem 2.2 (Recovery). /fvr n has exponential decrease (2.2) and gs 
is a product of \S\ univariate densities of the form e h with mean zero and 
finite second moment and h satisfying (2.3), then for any q € (0,2], for r n 
satisfying 

(2-4) r 2 n > {p n log(n/p n )} V log 

and sufficiently large M , as p n ,n — > oo such that p n /n — > 0, 

sup P nfio R n {9:d q {9,9 Q ) > Mrlpl-^X) -> 0. 

For q = 2 the theorem refers to the square Euclidean distance cfo, and 
asserts that the posterior distribution contracts at the rate r^, uniformly 
over £o\p n ]. The first inequality in (2.4) says that this rate is (of course) not 
faster than the minimax rate r* n 2=p n log(n/p n ). The second shows that it is 
also limited by the amount of prior mass ir n {p n ) put on the true dimension. 
If this satisfies (1.2), then log(l/-7r n (p n )) < r* 2 and the rate r\ is equal to 
the minimax rate. 
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Condition (1.2) for every p n leaves a free margin of a log(n/p n )-term 
over just exponential decrease of the prior ir n . If the decrease is still faster 
than (1.2), then the rate of contraction may be slower. For instance, for 
7r n (p) x exp(— p a ), for some a > 1, the rate for the square Euclidean distance 
given by the theorem is not better than p" , which is much bigger than r* 2 • 
In contrast, for a = 1 the theorem gives the minimax rate. 

For q G (0, 2) we can make similar remarks. The minimax rate r* over 
^o\Pn] for d q is given in (1.4). Because 

(r* \ q l 2 r > 1 - q l 2 - r* 
V n,2J Pn — 'n,qy 

the theorem shows contraction of the posterior distribution relative to d q at 
the minimax rate r* over £q [p n ] under the same conditions that it gives 
the minimax rate r* 2 for d2- (1-2) suffices. Furthermore, if there is less prior 
mass at p n , then the rate of contraction will be slower. 

In the case that < q < 1 the result is surprising at first when compared 
to the finding in [16] that the posterior median, or more generally so-called 
"strict-thresholding rules," attain the convergence rate r* , but the poste- 
rior mean converges at a strictly slower rate (even when 9q = 0; see Section 
10 in [16] and the remark below). By the preceding theorem the full posterior 
distribution does contract at the optimal rate r* q , for any < q < 2. This 
is true in particular for the case of binomial priors on dimension considered 
in [16] with the "best possible" (oracle) choice a n =p n /n. 

The slower convergence of the posterior mean relative to the contraction 
of the full posterior distribution is made possible by the fact that d g -balls 
have astroid-type shapes for < q < 1, and differ significantly from their 
convex hull if n is large. The posterior mean, which is in the convex hull 
of the support of the posterior, can therefore be significantly farther in d q - 
distance from 9q than the bulk of the distribution. By Theorem 2.1 only 
few coordinates outside the support of 9q are given nonzero values by the 
posterior. However, the corresponding indices are random and on average 
spread over {1, 2, . . . , n}, which makes that the posterior mean at a fixed 
coordinate is typically nonzero. Adding up all small errors in £ q typically 
gives a much higher total sum for q < 1 than for q > 1. In contrast the 
posterior median does not suffer from this averaging effect. 

The posterior measure thus provides a unifying point of view on the con- 
sidered objects. In this perspective for < q < 1 the posterior mean is a bad 
representation of the full posterior measure. 

Remark 2.3. From the arguments exposed in [16], it is not hard to 
check that the posterior mean generally fails to attain the minimax rate over 

[Pn] relative to d q for < q < 1 . Let us consider the case of Iq \p n ] classes 
with #o = 0. With the notation of [16], the posterior mean j}(x,a n ) with 
data X\ =x for the binomial prior on dimension with parameters (n,a n ) 
satisfies \/l(x,a n )\ > C\x\a n , by the same reasoning as in the last display 
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of page 1647 in [16] (the weight parameter w is fixed here and equals a n ). 
Hence the £ q -power loss ^ Pn,e \9o,i — ft-{Xi,a n )\ q when 9q = is bounded 
from below by a constant times na n . Thus, even for the "oracle" parameter 
cx-n = Pn/ n , this is much above the minimax risk for any < q < 1. 

2.3. Dependent priors. The preceding theorems are also true for priors 
that render the coordinates 9{ dependent. In the remaining theorems we 
assume that the densities gs in (P3) satisfy the conditions, for every S' C 
Sc. { 1 , . . . , n} and a universal constant c\ , 

(2.5) i O g 55 (0)-log^(^)<ci|5| + ^||0-^|| 2 V^'eR s , 

(2.6) |log 55 (0)-log 9s ,(7r s ,0)| <ci\S\ + ±\\ir s „ s ,9\\ 2 VflGi 5 

Here ir s : R n -> M 5 is the projection defined by tt s 9 = 9 S = :i 6 S). (The 
constant 64 corresponds to the constant 32 in Lemma 5.1, but has no special 
significance and can be improved.) 

For a partition S = Si U 52, we denote by 9 = #2) the corresponding 
partition of 9 £ M s and by fl r 5 1 ,s 2 (^i,^2) = 9s (9) the corresponding density. 
In the next theorem we assume that there exist C, mi > and, for any S2, 
probability densities 7s 2 on M> s ' 2 , such that for any #2 £ and Si C S%, 

, Q m 55-1,52 (01>02) . n |Si|+|S 2 | (a \ 

(2.7) sup — — <Cm^ '7S 2 (02)- 

This condition expresses that the "mixing between the coordinates within a 
given subspace" is not too important. 
Examples are given in Section 2.5. 

Theorem 2.4 (Recovery). Suppose 7r n has strict exponential decrease, 
that is, satisfies (2.2) with C = and some D > 0. The assertions of Theo- 
rems 2.1 and 2.2 are also true if the densities gs are not product densities, 
but general densities with finite second moments that satisfy (2.5), (2.6) 
and (2.7) with Dmi < 1, and mi the constant in (2.7). 

2.4. Complexity priors. The next results are designed for application to 
the particular priors ir n of the form, for positive constants a, 6, 

(2.8) Tr n (p)oce- aplo ^ bn M, 

where oc stands for "proportional to." Because e P lo s( n /^ < ( n ) < e P lo g(™ e /p), 
this prior is inversely proportional to the number of models of size p, a quan- 
tity that could be viewed as the model complexity for a given dimension p. 
Thus this prior appears particularly suited to the purpose of "downweight- 
ing the complexity." Forgetting about the extra component gs of the prior, 
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we can also consider it an analog of the penalty "2plog(re/p)" used in model 
selection in this context by (e.g.) Birge and Massart in [4]. Every particu- 
lar model with support S of size \S\ = p receives prior probability bounded 
below and above by expressions of the type e - a iP lo g( b i n /p) from this prior. 

Because the complexity prior (2.8) has exponential decrease (2.2) when 
b > 1 + e and satisfies (1.2), Theorems 2.1 and 2.4 (or Theorem 2.2) show that 
the corresponding posterior distribution concentrates on low-dimensional 
spaces and attains the minimax rate of contraction over ^ob«] relative to 
(any) d q if combined with densities gs satisfying the conditions of Theo- 
rem 2.4. The following theorem relaxes the condition on gs and gives a 
more precise result on the contraction of the posterior measure. 

The theorem applies more generally to priors on dimension satisfying the 
upper bound, for some a, b > 0, and every p G {0, 1, . . . , n}, 

(2.9) 7r n ,(p) <e- aplog(6n/p) . 

Theorem 2.5 (Recovery). If the densities gs have finite second mo- 
ments, satisfy (2.5) and (2.6) for some constant c\, and the priors 7r„ sat- 
isfy (2.9) for some a > 1 and b > e 7+2ci , then, for r n satisfying (2.4), for 
any 1 < p n < n and any r > 1 , 

sup P nA) U n (9 : || - O || > 45r n + Wr\X) < e" r2 / 10 . 
d Q ee [pn] 

Consistent with the preceding findings, the posterior distribution concen- 
trates on Euclidean balls of radius of the order r n around 9q . In addition the 
theorem shows that its "tail" is sub-Gaussian, uniformly in n and uniformly 
over ^obn]- A s one consequence, for every / G N, 

Pn,eoJ \\o-e \\ l du n (e\x)<r l n . 

By Jensen's inequality, this in turn implies the following corollary. 

Corollary 2.1 (Posterior mean). Under the conditions of Theorem 2.5, 

V/ G N sup P nA) 

The posterior mean j 8 dH n (9\X) as a point estimator of 0q has a risk of 
the order r n , relative to every polynomial loss function. In particular, it is 
rate- minimax over ^o[Pn] for the squared ^-risk. 

The posterior coordinate- wise median considered in the simulation study 
below is another interesting functional of the posterior measure. Under the 
conditions of Theorem 2.5 and (2.8), the posterior coordinate- wise median 
is rate-minimax over £o[Pn]> f or o- n V d g -distance, < q < 2; see [9]. 
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The theorem, with its explicit bound, is also the basis for results on the 
concentration of the posterior distribution when the true vector is in a weak 
m s [p n ]-class. Results for the posterior mean and ^2-risk can be obtained as 
above as a consequence. 

Theorem 2.6 (Recovery, weak class). If the densities gs have finite 
second moments, satisfy (2.5) and (2.6) for some constant c\, and the priors 
ir n satisfy (2.9) for some a > 1 and b > e 7+2ci , then, for r n satisfying 



r\ = min 

1<P<« 



21 s /i \ 2/s-l / \ 2 h 
sn I I 1 \ 1 [ Pn\ , n , 1 

V p log — V log ■ 



2 - s\pj \n J p Tr n (p) 

for any l<p n < n , s € (0, 2) and any r>\, 

sup P n>6o U n (9: \\9 -9 \\> 80r n + 20r\X) < e^ 2 / 10 . 

0o&m s [pn] 

For the "complexity prior" 7r n given by (2.8) the third term log(l/V n (f?)) 
in the minimum defining it is smaller than a multiple of the second term, and 
hence can be omitted. The minimum can then be determined by equating 
the first two terms, leading to 

(2.10) pl~n{p n /n) s /\og s l\n/p n ). 

If p* n > 1, then this value is eligible in the minimum, and the first and second 
terms evaluated at p* n are of the same order, given by 



2 Vn \ , i_ s /2 n 
<xn — log ' — . 

V re / Vn 

This in fact is the minimax rate //* s 2 for the square Euclidean metric d2 
over the class m s [p n ]; see (1.5). Thus the complexity priors combined with 
densities gs satisfying (2.5) and (2.6) [in particular, product densities satis- 
fying (2.3)] yield contraction at the minimax rate over both the nearly black 
vectors £o\p n ] and the weak m s [p n ] classes. For priors on dimension that 
are significantly smaller than the complexity priors, the third term in the 
minimum must be taken into account, and the rate of contraction is smaller 
than minimax. 

The condition p* > 1 is satisfied as soon as the sparsity coefficient p n /n is 
not too small. If the signal is very sparse and has p* n <C 1, then the minimum 
in the definition of is taken at p ~ 1 , leading to a squared rate of the order 
logn. This is within a constant of the rate achieved by hard thresholding in 
that case. 

The previous result extends under slightly stronger conditions to d q - 
distances with q > s. Furthermore, the following theorem shows that p* n 
is indeed an upper bound on the dimensionality of the posterior distribu- 
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tion. For simplicity we only state the result in the case of complexity priors. 
Recall that /x* sg , given in (1.5), denotes the minimax rate over the class 
m s [p n ] relative to d q . 

Theorem 2.7 (Dimensionality, recovery, weak class). Suppose the den- 
sities gs have finite second moments, satisfy (2.5), (2.6) and (2.7), and 7r n 
satisfies (2.8) for sufficiently large a > 1 and b > e. Then for any s £ (0, 2) , 
any q € (s,2) and any (p n ) such that p n /n —> and p* n given by (2.10) is 
bounded away from 0, for a sufficiently large constant M , 

sup P n ,e IL n (9: \S 6 \ > Mp* n \X) -> 0, 

6»oGm s [p„] 

sup P n ,e U n {9 : d q (9, 9 Q ) > Mfi* n)S jX) -> 0. 

0oGm s [p„] 

2.5. Examples. In this section we discuss examples of priors on dimen- 
sion 7r n and prior densities gs on the nonzero coordinates that satisfy the 
conditions of the preceding theorems. 

Example 2.1 (Independent Dirac mixtures). Consider the prior on 9 = 
(9i, . . . ,9 n ) 6 M n corresponding to sampling the coordinates 9i independently 
from a mixture (1 — a)5o + ag of a Dirac measure at and a univariate den- 
sity g, for a given a € (0,1). The coordinates of 9 are then independently 
zero with probability 1 — a, and hence the dimension of the model is bi- 
nomially distributed with parameters n and a. Furthermore, the nonzero 
coordinates are distributed according to the product of copies of g. Thus 
this prior fits in our set-up, with 7r n the binomial(n, a)-distribution and gs 
a product density. 

For a fixed a the coordinates 9i are independent, under both the prior 
and the posterior distribution. Furthermore, the posterior distribution of 9i 
depends on Xj only. 

This prior is considered in [12] and [16], in combination with a Gaus- 
sian or a heavy tailed density g, respectively. In the next section we show 
that Gaussian priors are deficient if the nonzero coordinates of the signal 
are large. The authors of [16] propose to use the coordinatewise posterior 
median (or another univariate point estimator) for estimating 9, with the 
weight parameter a set by a thresholded empirical Bayes method: the pa- 
rameter is chosen equal to the maximum likelihood estimator of a based 
on the marginal distribution of X in the Bayesian set-up (i.e., with 9 in- 
tegrated out but with fixed a) subject to the constraint that the resulting 
posterior median (after plugging in a) given an observation in the interval 
[— y/2 logra, y/2 log n] is zero. The authors show that the resulting point es- 
timator works remarkably well, in a minimax sense, for various metrics and 
sparsity classes. 
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A natural Bayesian approach is to put a prior on a, which yields a mixture 
of binomials as a prior 7r n on the dimension of the model. The independence 
of the coordinates 6% is then lost. We discuss this prior further in the following 
example. 

Example 2.2 (Binomial and beta-binomial priors). The binomial (n, a n ) 
distribution as the prior 7r n on dimension gives an expected dimension of 
na n . In the sparse setting a small value of a n is therefore natural. If the 
sparsity parameter p n were known, we could consider the choice a n =p n /n; 
we shall refer to the corresponding law as oracle binomial prior. 

Assume that p n — > oo with p n /n — > 0. The binomial prior has exponential 
decrease (2.2) if a n <p n /n. The oracle binomial prior a n ^p n /n is at the 
upper end of this range, and also satisfies (1.2), and thus yields the mini- 
max rate of contraction. The choice a n = l/n yields log7r n (p n ) of the order 
— p n log p n , and hence attains the minimax rate if p n is of the order n a , a < 1; 
for larger p n it may miss the minimax rate by a logarithmic factor. 

A natural Bayesian strategy is to view the unknown "sparsity" param- 
eter a as a hyperparameter and put a prior on it. The classical choice 
is the Beta prior, leading to the hierachical scheme a ~ Beta(K, A) and 
p\a ~ binomial(n, a), which corresponds to the following prior on p: 

( n\ B(n + p,\ + n- p) T(K + p)r(X + n-p) 

^{p) = „ ~Err~~C\ « 



p J B(k,X) p\(n — p)\ 

The mean dimension is tik/(k + A), which suggests to choose the hyper pa- 
rameters of the Beta distribution so that k/(k + A) is in the range (c/n, Cp n /n) 
It is easy to verify that the prior has exponential decrease (2.2), with C = 1, 
if (k — l)/p n < D(X — l)/(n — p n + 1) + D — 1. This suggests to choose small 
K and large A, thus giving a small variance to the Beta distribution. 

For k = 1 and A = n + 1 we obtain ir n (p) oc ( 2n n ~ p ) • Then 7r n (p) /iT n (p— 1) = 
(n — p + l)/(2n — p + 1), showing (strict) exponential decrease (2.2), with 
D = 1/2. By a binomial identity the norming constant is equal to ( 2n ^ 1 ) , so 

(2n — p)(2n — p — 1) • • • (2n — p — n + 1) / p + 1 
w) = 7ZZ i TvTI 7nZTT~t ZTTT\ ^ 1 



(2rc + l)2n ■ • ■ (2n + 1 - n + 1) ~ \ n + 2 / 
For p n /n —7- 0, this gives vr n (p n ) > e -Vn,(i+o{i)) ^ anc j jj eilce (1.2) is satisfied. 
More generally, we may choose k = 1, A = Kin + 1, which leads to ir n (p) oc 

further alternative 



( «in P ) • ^' ie P r i° rs given by ir n (p) oc ( n n p ) 1 , for some «x > are a 



Example 2.3 (Poisson priors and hierarchies). The Poisson(a) distri- 
bution truncated to {0,1, ... ,n}, yields priors satisfying 

MP) oc ( —^- x Ce~ plog( - p ^e p — 

p' Vp 
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for p — > oo, by Stirling's approximation. The mean is approximately a, sug- 
gesting a in the range (l,cp n ). As vr n (p)/-7r n (p — 1) = a/p, the prior has 
exponential decrease (2.2) for p > a/D. 

If we put an exponential (A) hyperprior on a, then 7r n transforms into a 
shifted geometric distribution (shifted — 1 to have support starting at 0) with 
success probability A/(l + A). A Gamma hyperprior yields a shifted negative 
binomial. For fixed hyper-hyper parameters both are of the form e~ Cp for 
some constant C, and hence have exponential decrease, and satisfy (1.2). 

Example 2.4 (Complexity prior). The prior iT n (p) oc e _aplog ( bn/ ' p ) has 
exponential decrease (2.2) for b > 1 + e and satisfies (1.2). Theorems 2.5, 2.6 
and 2.7 show that this prior also gives sparsity and minimax recovery of the 
parameter over weak ^-classes. Although our results do not show the oppo- 
site assertion that mere exponential decrease is not enough for minimaxity 
on weak classes (while together with (1.2) it is enough for minimaxity over 
^o[Pn])j this might be a potential advantage of complexity priors over the 
binomial and Poisson-based priors discussed previously. 

Example 2.5 (Product prior). Densities gs that are products of \S\ 
copies of a univariate density with finite second moment of the form g = e h 
for h : R — > R a function that satisfies (2.3), satisfy (2.5), (2.6) and (2.7). In 
this sense Theorem 2.4 is a generalization of Theorem 2.2. 

To see this note that for a product density the function gs takes the 
form gs(0) = exp{^ ig5 h(6i)}. Hence if (2.3) holds with proportionality 
constant 1, then the left-hand side of (2.5) is bounded in absolute value 
by 

5>(fc) - K%) < \s\ + \\e - 0% < \s\ + y/\s\\\e - e'\\ < s\s\ + ^P- 0'f. 

Furthermore, the left-hand side of (2.6) is bounded by 

E IWI< |5-^||fe(0)|+ £ (l + 1^1) < E \0i\. 

ies-S' ieS-S' ies-S' 

The Li-norm of (Oi'.i £ S — S') can be bounded by a linear combination of 
\S — S'\ and the square L2-norm, as before, and hence the whole expression 
is bounded by C\S\ + ||7T5_5/^|| 2 /64, for some constant C. 

Because a product density gs is a product of the marginal densities, the 
validity of condition (2.7) is clear. 

Example 2.6 (Weakly mixing priors). For /i:R— )-R a function satis- 
fying (2.3) so that e h is integrable and G : [0, oo) — > R a Lipschitz function 
that is bounded below, consider, for 6 = (#i, . . . ,9 P ), 

g p (e) = a p eZUm)- Gm ^ 
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where a p is the normalizing constant. An example is the prior, for a > 0, 



In the Appendix [9] it is shown that priors of this form satisfy (2.5) and (2.6). 
Furthermore, it is shown that (2.7) is also satisfied, with mi = (1 + a)/(l — a) 
if —h is the absolute value of the identity function and the Lipschitz constant 
a of G is strictly smaller than 1 [i.e., \G{s) — G(t)\ < a\s — t\ for a < 1]. 

Thus any prior of this form combined with any prior on dimension that 
decreases exponentially such that Dm\ = D(l + a) /{I — a) < 1, for D the 
constant in (2.2), gives recovery at the minimax rate over ^o[Pn]> by The- 
orem 2.4, and also over £ s [p n ] if combined with a complexity prior on di- 
mension satisfying the conditions of Theorems 2.6 and 2.7. For instance, the 
hierarchical binomial prior ir n (p) oc ( 2n ~ p ) in Example 2.2 has D = 1/2 and 
hence a < 1/3 suffices for contraction over 4)[Pn]- 

2.6. Lower bounds. Condition (2.3) [or (2.5) and (2.6)] on the priors gs 
for the nonzero coefficients ensures that the posterior does not shrink to 
the center of the prior too much. In the next theorem we investigate the 
necessity of conditions of this type. The theorem shows that product priors 
with marginal densities proportional to y \— > e - '^" for some a > 1 lead to 
a slow contraction rate for large true vectors 6q. We formulate this in an 
asymptotic setting with a sequence of true vectors, written as 6q , tending 
to infinity. We denote by p n the number of nonzero coordinates of 6q. 

The theorem applies in particular to the normal distribution. For this 
prior a problem (only) arises if the parameter vector 6q tends to infinity 
faster than the optimal rate 



The posterior then puts no mass on balls of radius a multiple of ||#q|| around 
the true parameter. For "small" Oq no problem occurs, because shrinkage to 
the origin is desirable in that case. However, if the true parameter satisfies 
11^0 II 2 ~ Pn l°g( n /Pn); then the estimator that is zero, irrespective of the 
observations, possesses mean square error of the order the minimax risk for 
the problem. Thus it is rather poor consolation that the Bayes procedure 
based on Gaussian priors performs well in this CclSG, clS it is no better than 
the "zero estimator." Gaussian priors really are problematic. 

Product priors with marginal density proportional to y \— > e~^ a give be- 
havior as the Gaussian prior for every a > 2. For a G (1,2) the result is 
slightly more complicated and involves the quantities 



0oH 2 »Pnlog(n/p n ). 



(2.11) 




16 



I. CASTILLO AND A. W. VAN DER VAART 



where || • || Q denotes the usual L Q -norm on M n (i.e., ||6>||^ = Y^i \6i\ a )- The 
numbers pg a increase to infinity as 9q tends to infinity at a sufficiently fast 

rate. For instance Pq a is of the order c°~ 1 pl/ 2 if a < 2 and 9q = c ti 6q 
for scalars c n and fixed vectors 9q. The following theorem shows that if p^ a 

increases to infinity faster than the optimal rate (p n log(n/p n )) 1 / 2 , then the 
posterior does not charge balls of radius a small multiple of pg a . 

Theorem 2.8 (Heavy tails). Assume that the densities gs are products 
of S univariate densities proportional to y ^ e 1 2^ I " and the prior ir n on 
dimension satisfies (1.2) for some c> 0: 

(i) If a > 2 and ||#g \\ 2 /(p n l°g( n /Pn.)) — > oo, then for sufficiently small 
7] > 0, as n — > oo, 

P n ^U n (9:\\9-9^\\<ri\\9^\\X n )^0. 

(ii) If 1 < a < 2 and (p^ a ) 2 /(p n log(n/p n )) — > oo, then for sufficiently 
small 7] > 0, as n — > oo, 

P n;9 nU n (9 : \\9 -&S\\< VPojXl -> 0. 

Theorem 2.8 shows problematic behavior of the posterior distribution for 
signals with large energies \\9q |. Instead of using fixed priors on the coordi- 
nates, we could make them depend on the sample size, for instance, Gaussian 
priors with variance v n — > oo, or uniform priors on intervals [— K n , K n ] with 
K n — > oo. Such priors will push the "problematic boundary" toward infinity, 
but the same reasoning as for the theorem will show that shrinkage remains 
for (very) large 9q. 

The above results show that gs needs to have heavy tails. Another im- 
portant condition, this time concerning the prior 7r n on the dimension k, 
concerns the amount of mass TT n (p n ) at the true dimension. If this quantity 
is too small, then the Bayes procedure might not be optimal. 

Theorem 2.9. Suppose also that the prior ir n on dimension in (PI) is 
decreasing and that there exist integers di <n < c?2,n such that, for some C > 
and a sequence e n such that ns^ — > oo, 

^(tfo.n) ( n \ < e ~Cnel 
7T n (dl, n ) \dl,n) ~ 

Denoting d^^ = (3^2, n — d\^ n )/2, there exists 9q in 4)[^3,n] such that, for 
sufficiently small r\ > 0, as n — )• oo, 

P n ^U n (9 : \\9 -0S\\< vV^e n \X n ) -> 0. 

Example 2.7 [Prior on dimension in exp(— fc(log k) a ), with a> 1]. If 
T^n{k) = rexp(— A;log a k), with r the appropriate normalizing constant, let 
us apply the preceding result with the choices d\^ n = p n /4, d2, n = 3p n /4, for 
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some sequence p n — > oo. It holds 

7Tn(3p n /4) / Tl \ < e _(3p„/4) log a (3p„/4)+(p n /4) log a (p„/4)+(p n /4) log(ne) 

< e - (Pn /4) log" (3p„/4) - (p n /4) log" (3p„ /4) 2l/a /4) log" (ne) _ 

As long as we impose (3p n /4) 2l/a > ne and log(3p n /4) > 2~ 1 / a logp n (which 
holds for large enough n), the last display is at most exp(— ^ log a p n ). The- 
orem 2.9 implies that there is a vector 6q in £obn] with 

P n ,e^U n (9 : \\9 - 0S\\ 2 < r) Pn log a Pn \X n ) -> 

for a small enough constant 77. This implies that the corresponding estimator 
does not reach the optimal rate over the class ^o[Pn] as soon as p n log a p n 
tends to infinity faster than p n \og{n/p n ) [take, e.g., p n = nj exp(- v /logn)]. 

2.7. Discussion. We have identified general conditions on the prior that 
ensure optimal convergence rates for estimating a sparse mean vector in 
Gaussian noise. In particular, natural fully Bayes estimates (e.g., Beta- 
binomial prior on dimension) are shown to be adaptive with respect to the 
unknown smoothing parameter p n /n. 

Especially in high-dimensional contexts the full posterior measure and 
special aspects of it can start to have divergent behaviors. We have seen that 
for nonconvex distances the posterior mean is not a satisfactory projection. 
It can also happen that the mode and the full posterior behave differently. 

In some situations one might want to estimate prior hyperparameters, 
and in this case, it is desirable to assess the convergence properties of the 
resulting plug-ins. To our knowledge, there are only a few works in this 
direction; see [15, 16]. Potential alternative proofs could consist in obtaining 
first (suitably uniform) results for the (full) posterior measure and combine 
them with a statement saying that "the plug-in estimate is not too bad." 
Also, here, one could evaluate the sparsity coefficient rj n =p n /n via the 
posterior number k n of selected models and plug this estimate into the full 
posterior for the binominal prior on dimension. Since f\ n = k n /n does not 
exceed Cp n with high probability, we have some control of the plug-in into 
the full posterior. The question of then deriving results for estimates of it 
(e.g., the mean), remains open. 

3. Implementation. In this section we provide an algorithm to compute 
several functionals of the posterior measure associated with the prior defined 
by (P1)-(P3), including the posterior mean, marginal posterior quantiles 
and the posterior of the number of selected models. The algorithm is exact 
in that it does not rely on an approximation of the posterior distribution, 
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but computes the exact expressions. We illustrate the posterior quantities 
through simulations. 

We assume that the densities gs on M are products of S copies of a 
univariate density g. Because the prior on the number of nonzero coordinates 
induces dependence, this generally does not entail a factorization of the 
posterior distribution as a product measure. (An exception is the binomial 
distribution for 7r n .) 

For all computations, we need the denominator of the posterior measure 
in (2.1) (the "partition function"). For (ft the standard normal density, and 
ip = 4>* g its convolution with the density g, this can be written 

p=0 \p) \S\=pi€S i<£S 

Naive computation directly from this expression would require a number 
of operations that grows exponentially with n. However, the sum over all 
models S of size p (the inner sum in the display) is equal to the coefficient 
of Z p in the polynomial 

n 
i=l 

This polynomial can be computed by a quadratic number of operations by 
computing the products term by term, and in nlog 2 n operations by a more 
clever algorithm. 

3.1. Posterior mean. The posterior mean 9 PM = f dH n (0\X) is a ran- 
dom vector in M. n . Letting ((x) = J t(j){x — t)g(t)dt, we can write its first 
coordinate in the form 

§ p M= i ±^ aXl) n *(**)n*M- 

^ n p =i \ P ) \s\=p,iesies,v£i i$s 

The inner sum (over S) is the coefficient of Z p in the polynomial Z \— > 
Q{Xi)Z\\^ =2 {(f){Xi) + ip(Xi)Z). Hence it can be computed as before. 

3.2. Coordinatewise quantiles. The distribution function of the marginal 
posterior distribution of the first coordinate can be written, for any real u, 



n((-oo, u] x r^X) = (1 - c/ n ,i)l u > + q n ,v 



where 1 — qi )U is the posterior probability that the first coordinate is zero, 
and ip(x,u) = J*"^ <p(x — t)g(t) dt. The former probability can be written 

i - qn>1 =Pr(e 1 =o\x)=^-j2 I i^- e n^wnw 
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Hence it can be computed as before, now involving the polynomial Z i— > 

Given the marginal posterior distribution, we can compute marginal quan- 
tiles. For instance, the first component of the coordinatewise median # med 
is given by, with H~\ the inverse of H n ^i{u) = ip(Xi, u)/ip(Xi), 



^med 



H 



l,n 



2qi, n 



VO 



+ 



H n!l 



AO 



The last display should be understood with the convention H n \ {u) = — oo 
if u < and H~\(u) = oo if u > 1. 



3.3. Number of nonzero coordinates. The posterior distribution of the 
number \S$\ of nonzero coordinates of 8 6 M n is the random distribution on 
the set {0,1,..., n} given by 

u n (e:\s e \=p\x) = ^-^- £ JI^WII^TO- 

IpJ |5|=p«G5 i£S 

The same computational scheme applies. In fact the sum will already be 
computed in the derivation of Q n . 



3.4. Simulations. In a small simulation study we considered the prior 
defined by (P1)-(P3) with g a Laplace density x — > (a/2)e~ a ^ , with scale 
parameter a > and two priors on dimension, suggested by our theoretical 
results, given by 

(3.1) vr n (p)oce- Kplog ( 3n / p ), 

(3.2) vr n (p) K ( 2n - p )\ 

Here k is a real parameter, which for both priors quantifies how fast they 
decrease to zero with p. In the results shown we used a = 1 and k G {0.1, 1}. 

We simulated signals 6 = (9\, . . . , 9 n ) of length n = 500, for various settings 
of the sparsity p n = / 0) and for signals 9 with the nonzero coordinates 
set equal to a fixed number A. We show the results for p n £ {25, 50, 100} and 
"signal strength" A £ {3,4,5}. 

Tables 1 and 2 report estimates of the mean square errors E\\9 — 9\\ 2 
and mean absolute deviation errors E||0 — of eight estimators 9. These 
estimates are the average (square) error of 100 estimates 9±, . . . ,^ioo com- 
puted from 100 data vectors simulated independently from model (1.1). 
The eight estimators include the posterior means PM1, PM2 and coordi- 
natewise medians PMedl, PMed2 associated with the two priors 7r n with 
k = 0.1, the empirical Bayes mean EBM and median EBMed considered 
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Table 1 

Average square errors of eight estimators computed on 100 data vectors X of length 
n = 500 simulated from model (1.1) with = (0,0, ... ,0, A, A) , where p n coordinates 
indices are equal to A. In every column the smallest value is printed in bold face. The 
estimators are: PM1, PM2; posterior means for two priors n n in (3.1) and (3.2) and 
Laplace prior on nonzero coordinates; PMedl, PMed2 coordinatewise medians for the 
same priors; EBM, EBMed.- empirical Bayes mean and median for Laplace prior; HT, 
HTO: hard-thresholding and hard-thresholding-oracle 







25 






50 






100 




A 


3 


4 


5 


3 


4 


5 


3 


4 


5 


PM1 


111 


96 


94 


176 


165 


154 


267 


302 


307 


PM2 


106 


92 


82 


169 


165 


152 


269 


280 


274 


EBM 


103 


96 


93 


166 


177 


174 


271 


312 


319 


PMedl 


129 


83 


73 


205 


149 


130 


255 


279 


283 


PMed2 


125 


86 


68 


187 


148 


129 


273 


254 


245 


EBMed 


110 


81 


72 


162 


148 


142 


255 


294 


300 


HT 


175 


142 


70 


339 


284 


135 


676 


564 


252 


HTO 


136 


92 


84 


206 


159 


139 


306 


261 


245 


in [16] with a standard Laplace prior 


, and the hard-thresholding HT and 


hard-thresholdin 


g-oracle HTO estimators, given by 










er= 




g7U 


5HTO _ 
'i 


" Xll \X l \>^2\ogn/ Pn - 




The last 


estimator uses 


the 


; 'oracle" 


value of the sparsity parameter 


Pn, 


whereas the other seven 


estimators do not use this value. 














Table 2 










Average absolute deviation 


errors 


of eight estimators computed 


on 100 data vectors X 


of length 


n = 500 


simulated from 


model (1.1) with 


9= (0,0,... 


,0,A,. 


. . , A), where p n 


coordinates indices 


are equal to A. 


In every 


column 


the smallest 


value 


is printed in 


bold 






face. The 


: priors and estimators are as in Table 1 






p n 




25 






50 






100 




A 


3 


4 


5 


3 


4 


5 


3 


4 


5 


PM1 


80 


101 


110 


127 


145 


147 


240 


268 


270 


PM2 


79 


85 


87 


135 


145 


144 


219 


232 


232 


EBM 


95 


110 


117 


191 


200 


176 


260 


285 


281 


PMedl 


51 


43 


45 


86 


80 


78 


178 


225 


230 


PMed2 


50 


40 


37 


86 


79 


76 


156 


162 


163 


EBMed 


50 


48 


45 


108 


121 


97 


212 


258 


257 


HT 


63 


44 


27 


122 


86 


53 


244 


173 


102 


HTO 


53 


41 


40 


91 


79 


74 


157 


148 


144 
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The tables show that the mean and median of the full Bayesian posterior 
distribution are competitive with the empirical Bayes estimates. The behav- 
ior of the full Bayes and empirical Bayes estimates seems similar, up to a 
few aspects. In terms of squared risk, empirical Bayes estimates appear to 
be slightly better for small p n and small A, while the full Bayes estimates 
appear to be slightly better for larger signals and larger p n . For L 1 -risk, the 
full Bayes estimates appear to outperform the EB-estimates in most of the 
cases. (Additional simulation results, not shown, suggest that the situation 
becomes less unfavorable for empirical Bayes as the scale parameter a of the 
Laplace prior is taken smaller than 1.) In agreement with [16], in most cases 
the mean estimates perform not quite as well as the median ones, already 
in terms of squared-risk. 

The parameter a of the Laplace prior plays the same role for the full 
Bayes as for the empirical Bayes estimates. Although we do not investigate 
this aspect here, it could be estimated from the data, as is proposed in the 
EbayesThresh package, or be treated as a hyperparameter in a full Bayes ap- 
proach. [A single scale parameter for high-dimensional densities gs appears 
to create dependence between the coordinates that is stronger than what 
is allowed by our conditions (2.5) and (2.6), and hence would need further 
analysis.] Similar remarks pertain to the parameter k. The choice k = 0.1 
seemed to be fairly good uniformly over all considered simulations, also for 
smaller n's. 

For further illustration Figure 1 shows marginal 95% credible intervals 
(orange bars) for the parameters 9±, . . . , 9 n , and marginal posterior medians 
(red dots) for a single simulation of the data vector, with single strength 
A = 5, p n = 100 and n = 500. The observations X±, . . . , X n are indicated by 
green dots. The credible intervals are defined as intervals between the 2.5% 
and 97.5% percent quantiles of the marginal posterior distributions of the 
parameters. The intervals corresponding to zero and nonzero coefficients B% 
are clearly separated, although some of the credible intervals of nonzero 6i 
contain the value zero. Also visible is that the posterior medians and the 
credible intervals surrounding them are shrunk toward zero relative to the 
observed value Xj, for the zero coordinates 8i, which is desirable, but also 
for the nonzero Q\. Figure 1 (bottom) shows that for k = 1 the shrinkage 
effects are stronger, and the credible intervals shorter. 

Since our main goal here is illustration, we only implemented a simple 
version of the algorithm. This computes the polynomials with direct loops 
and can be improved. This implementation is limited to n of the order 500, 
not by computing time, but by the appearance of large numbers in the poly- 
nomial coefficients that overflow standard memory capacity (10 -300 , 10 300 ). 
Handling larger n should certainly be possible by improved programming, 
for instance, by computing on a logarithmic scale. Algorithmic complexity 
appears not to be a major issue. 
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Fig. 1. Marginal posterior medians (red dots) and marginal credible intervals (orange) 
for the parameters 6i,...,9 n for a single data vector Xi, . . . ,X„ simulated according to the 
model (1.1) with = (0,0, ... ,0,5, ... ,5) , where n = 500 and the last p n = 100 coordinates 
are nonzero. The data points are indicated by green dots. The prior g is the standard 
Laplace density, and tv„ is as in (3.2) with "inverse temperature" Ki =0.1 (TOP graph) 
and ki = 1 (BOTTOM graph). 



4. Proof of Theorem 2.1. We first prove the theorem for priors on di- 
mension ir n (p) with strict exponential decrease and densities gs that are not 
necessarily of product form, but that satisfy (2.7), for Dm\ < 1, and D the 
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constant in (2.2). Thus the proof also covers half of Theorem 2.4. In view 
of Example 2.5, densities of the product form satisfy (2.7) with mi = 1, and 
hence automatically have Dm\ < 1. 

Since the true parameter 9q is assumed to have p n nonzero coordinates, 
it is sufficient to prove that the intersection of the support Sg with the 
complement Sq of the support So = Sq of 0$ has dimension of the order p n 
under the posterior distribution. The following proposition gives an explicit 
bound on this dimension; it is followed by a lemma that shows that this 
bound tends to zero under the conditions of the theorems. The idea of the 
proof of the proposition is to condition on the vector of the coordinates ns 9 
of 6 that belong to So. 

The unconditional density of (Sg,9) for 9 drawn from the prior II n is given 
by, with 5o denoting a "Dirac density at 0," 

[\s\) 

The conditional density of (Sq n Sq,9s^) given 9s is proportional to this 
expression viewed as function of (S fl 5q, OsnSfi)- This shows the conditional 
distribution has the same structure as the prior II n , but with sample space 
M, s o rather than R n , with the density of the nonzero coordinates of 0sg given 
by 9SnS§\SnSo('\QsnS ), proportional to gsnS§,SnS (-,QsnS ), and the prior on 
dimension given by 

(4.1) ^n,k(p) (X7T n (p + k) , n \ , k = \S e nS \. 

\p+k) 

The extra factor (quotient) on the right arises because Tr n ,k(p) and 7r n (p + k) 
are the probabilities of the given dimensions, and hence the sums of the 
probabilities of all subsets of that dimension. Recall also that we assume 
that n n (p) is positive for any p, which makes the maximum appearing in the 
following proposition always finite. 

Proposition 4.1. If the densities gs satisfy (2.7), then, for any A > 1, 

^n,k(P) 



sup P n ,e Hn(9 :\S e n SZ \ > A\X) < V m{ n +p max 

e ee [ Pn ] ^ A o<k< Pn 



vr n ,fc(0) 



Proof. For B = {9 : \S e n S£| > A} and rf^ (-\X) the marginal distri- 
bution of 9s if 9 is distributed according to the posterior distribution, 

U n (B\X) = J IL n (B\X,0 So =0 1 )<E& 3o (e 1 \X) 
< sup n n (B|X,05 o =^i)- 
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In the Bayesian setting the vectors X$ and Xgg are conditionally indepen- 
dent given 9 with marginal conditional distributions depending on 9s and 
0s£ only, respectively. This implies that the distribution of 9sg given (X, 6s ) 
depends on (Xs^,9s ) only. The joint distribution of (Xs^,9s^,9s ) can be 
generated by first generating 9s from its marginal distribution derived from 
II n , next generating 9s c from its conditional given 9s derived from U n , and 
finally generating X$c from the A r n _p n (^sg,/)-distribution. It follows that 
the conditional distribution of #gg given (X, 9s ) can also be described as 
the "ordinary" posterior distribution of 9g Q given the observation X$c rel- 
ative to the prior on 9s? given by the conditional distribution of given 
9s derived from II n . If II n (-|0i) denotes the prior induced on R^o when 
conditioning IT n to the event that 9s = 9i , and n 2 = n — p n , then 

(4.2) U n (B\X,9 So =9 1 )- jBPn2M So> ny2lV 



!Pn 2 fi 2 {X s ^) ( m n {9 2 \~9i) 



The denominator of the right-hand side can be bounded below by restricting 
the integrating set to the singleton {9 2 = 0} , leading to 

/ Pn 2 ,9 2 ( X S§) ^n{h\h) > n„(^ 2 = Ol^K^Osg^g). 

Let S 2 denote the indices of the nonzero coordinates of 9 2 G 9 2 the 
vector of their values and n 2 = \ S 2 \, and similarly for Si , 9\ . Then 

n n ( J B|x,^ = ^i)<n n (^ = o|ei)" 1 f ^^(Xs^dUM^) 

./R Pno.n„r 



With the notation Si,9±,9 2 introduced above, one obtains 

/ ^(X ss )dU n (9 2 \9 1 ,S 2 )= [ ^(X S2 ) gS ^fl d9 2 . 

J Pn 2 ,0 s c J Pn 2 fls 2 J 9S 1 ,S 2 [Pl,9 2 )d9 2 

On the other hand, an application of Bayes's formula leads to 

n n (5 2 |gi) iL n (s 1 ,s 2 ) r g Sl ,s 2 {Qi,e 2 ) 

U n {S 2 = 0\9 l ) U n (S 1 ,S 2 = 0)J gsM 2 ' 
and the last ratio of prior probabilities of subsets is equal to 

n n (5i,S 2 ) _ ir n (p + k) (%) _ n n , k (p) 1_ 

U n (S 1 ,S 2 = 0) ~ { p l k ) Mk) ~ vr n , fc (0) ' 



SPARSITY AND BAYES POSTERIOR MEASURE 



25 



Combining the previous identities and condition (2.7), one obtains that 
H n (B\X,8s = 9\) is bounded above, uniformly in 9\ (S\,9i), by 



The proposition follows, since P n ,8oPn2,e 2 /Pn 2 ,Os 2 ( x s 2 ) = 1 - a 

Lemma 4.1. If n n satisfies (2.2) with C = and a constant D such that 
m\D < I, then YZ= P p n m i n+P m ^k[^n,k(p) /^n,k(0)] -> for P n bigger than 
a sufficiently large multiple of p n and P n — > oo . 

Proof. From the expression of ix n ^ in (4.1), simple algebra leads to 



Using the assumed strict exponential decrease, the second ratio in the last 



M > as soon as x is iarger than a sufficiently large multiple of M, the 
result follows. □ 

Combining Proposition 4.1 and Lemma 4.1 concludes the proof of the first 
half of Theorem 2.4 and of Theorem 2.1 for priors on dimension with strict 
exponential decrease. 

For gs of the product form and 7r n with just exponential decrease [C > 
in (2.2)] such as the oracle binomial prior, we use a slight variant of the above 
argument. Starting from (4.2), the denominator can be bounded below with 
the help of Lemma 5.2 (below), applied with n<i instead of n, with Oo = 
and both II = II = H n (-\9i). This implies that H n (B\X, 9s = 9\) is bounded 
above by 



where fi 2 = f 9 2 dU n (9 2 \9i) and of = / \\9 2 \\ 2 dU n (9 2 \9i). In fact fj, 2 = 0, by 
the assumption that the common density g has zero mean. If m 2 denotes 
the second moment of g, we have 







n-pn 




p=0 
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This implies that U n (B\X,9s = Oi) is uniformly bounded in 9\ by 

n-p„ 

p=A |5 2 |=p ^ P ' ^ n 2,u Sa 

To conclude one takes the P n! g -expectation and uses Lemma 4.2 below. 

Lemma 4.2. // vr n satisfies (2.2), then <m2D\p n with D\ that de- 
pends on C,D in (2.2) only. Furthermore, Y^j=^ n max -k{'^n,k{p)& l ' k ) for 
P n bigger than a sufficiently large multiple of p n and P n — > oo. 

5. Proof of Theorems 2.2 and 2.4. In view of Theorem 2.1 the posterior 
mass of models of dimension bigger than Ap n , for a large constant A, tends 
to zero. Thus it suffices to show concentration around 9q m models with 
\Sg\ <Ap n . This is achieved using testing arguments. Proposition 5.1 gives 
an explicit bound on concentration with respect to the Euclidean metric. 
General dq-metrics are next treated by interpolation of metrics. 

Let $ be the standard normal distribution function and $ = ! — 



Lemma 5.1. For any a, (3 > and any 9$,9\ G W 1 there exists a test 
4> based on X ~ N(9,I), such that for every 6 £ W 1 with \\9 — 9\\\ < \\9q — 
0i||/2 = P, 

aP nfi J + pp nfi (l -<j,)< a ${£ + l log ^ + /3<D (-£ + i log 2 

This quantity can be further bounded by 2yJ 'ape~^ e °~ dl ^ '^ 2 . 

We note that the bound of Lemma 5.1, even though valid for every a, f3 > 
0, is of interest only if a and (5 are not too different: if loga//3 < — \\9q — 
6>i|| 2 /32 or loga//3 > \\9 - 9 1 \\ 2 /32, then the trivial tests <f> = 1 and <f> = 
give the better bounds a and /3, respectively. 



LEMMA 5.2. For any prior probability distribution IT on W 1 , any positive 
measure IT with IT < II, and any 9q € M. m , 

f PlhL^ X )dU(9) > llnlle- 52 ^^-^^ 

J Pnfi 

where fi = J(9 - 9 )dU(9)/\\fl\\ and a 2 = f \\9 - 9 \\ 2 dfl(9)/\\U\\. Conse- 
quently, for any r > 0, 

Pn,e ( dU(9) > e~ r2 U(9 : \\9 -9 \\<r))>l- e^ 8 . 
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Lemma 5.3. The volume v p of the p- dimensional Euclidean unit ball 
satisfies, for every p>l, setting d\ = l/y/n and d 2 = e}l^d\, 

d^eitf'^-v/ 2 - 1 / 2 <v p < d 2 {2e^yl 2 p-^ 2 - 1 / 2 . 

Lemma 5.4. Let S C {1, . . . ,n}, p= \S\, j > 1 andr 2 > p„Vlog7r n (p n ) _1 . 
Let 6s j G M n with support S and 2jr n < \\9s,j — #o|| < 2(j + l)r n . For some 
universal constant C3 > 0, we /iai>e that 



log 



11(0 G R« : S e = S, \\ir s 9 - 9 S J < jr n ) 



e~<U(9 et", \\9-9 1| <r n ) 
< c 3 (p + P„) +plogj + 9(j + l) 2 r 2 /64 + 7r 2 /2. 

Proof. Denoting /3s j the quantity in the logarithm in the last display, 
n(S)G s (9eR s :\\9-TT S 9 s J < jr n ) 



< 



e- r2 nU(S )G Sa (9 € : \\0 - nSo e \\ < r n ) 
U{S)v s (jrJ s \ max(<7 S (0) : \\9 - 7r s 9 s J < jr n ) 



e-^U(S )v So rl S ° l min( 5S o(0) : ||0 - n So 9 Q \\ < r„) ' 
Let us decompose, for any 9' G M" 5 and 9 G M 5 ° , 

gs(gO = 9s{9') gsnSo(nsns 9') 9sns {^sns 9) 
9S (9) gsns (nsns 9') gsnS (^SnS 9) 9s (9) 
Combining this identity with (2.5) and (2.6), we obtain, with c 2 = 1/64, 



, 9s(9') 
log- 



9S (9) 



< Cl |5|+ Cl |5n5o|+ Cl |5o| 



+ C2\Ws-S 9'\f + c 2 ||vr Sn 5 (6' / - 9)\\ 2 + c 2 ||vr 5o _ s 6'|| 2 . 

Denoting by 9,9' the vectors of R n with respective supports So,S and such 
that 7rs 9 = 9, 7Ts9' = 9', note that the last line of the previous display is 
bounded above by C2||#' — #|| 2 . For \\9' — ns9s,j\\ < jr n and \\9 — tts q 9q\\ < r n , 
we have 

\\9' - 9\\ <\\9'- 9 s ,j\\ + \\9 s ,j - 9 \\ + \\9 - 9\\ < 3(j + l)r n . 
Due to Lemma 5.3, the quotient v p rn/(v p r^™) is bounded by 

Since r 2 > p n by assumption, we have ( v / Pn/ r n) Pn < 1; an d because the 
function p 1— >• plog(r 2 /p) takes a maximum at p = r 2 /e, we obtain, for some 
universal constants C,C , 

Psj < / e ^ +c 'P" +9C2 ^ +1 ) 2r ' + ( 1+1 / 2e ) r "n(s)/n(So). 
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To conclude, one notes that 11(5) < 1 and that ( p rt ) < (ne/p n ) Pn < e r ™ +Pn 
by the assumption on r n , so that LT(5o) > e _2r " _Pn . □ 

Proposition 5.1. If the densities gs satisfy (2.5) and (2.6) and have 
finite second moments, then there exist universal constants d±,d2,d3 such 
that for M > 10 and 1 < A < n/(2p n ) and r\ satisfying (2.4) and p n /n — > 0, 
as n — > +oo, 



Proof. Let 5i be the collection of subsets 5 C {1,2, ... ,n} such that 
|5| < Ap n . For each such 5 and j = 1, 2, . . . let {Qs,j,i '■ i £ Is,j} De a maximal 
jr n -separated set inside the set {9 £ M. n : Sq = S,2jr n < \\9 — 8q\\ < 2(j + 
l)r n }. Because the latter set is within a ball of radius 2(j + l)r n of the 
projection LTs#o onto the subspace of vectors with support inside 5, a volume 
argument shows that the cardinality of Is j is at most 9 1 5 '. 

We can partition the set of vectors with exactly support 5 by assigning 
each such vector to a closest point 9s,j,i f° r some j = 1, 2, . . . , and i £ I$j- 
The resulting partitioning sets B$j,i will fit into balls of radius jr n . For each 
9s,j,i fi x a test (j>s,j,i as in Lemma 5.1 with a = 1 and the triple (9q, #i), p and 
j3 taken equal to the triple (9o,9s,j,i), jr and fts,j,i-i where the last numbers 
will be determined later. In view of the second assertion of Lemma 5.2 
applied with r equal to r n , there exist events A n such that P n) g (An) < 
e - r n/ 8 5 on which 



sup P n: e U n {9:\\9-9 \\ >Mr n ,\S e \ < Ap n \X) 





We have that 



P nA ^n{9 : \\9 - 8 \\ > 2Mr n ,S e G 5i|X)l A 



<EEE P n ,e o n n (eeB S!jti \X)l A 



Se5i 3>Mi&I Stj 




S&Sij>Mi&I Sd 
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where we have denoted 

PS,J ' 1 e- r *Il(9 : \\0-0 \\ <r n )' 
In view of Lemma 5.1 the term within the triple sum is bounded using by 
2\/ fis,j,ie~^ Tn ^ 8 . Since \S\ = p < Ap n and p n /n — > 0, we can take n large 
enough in order to have both c%{p + p n ) < r n/10 an ^ plogj < j 2 r 2 /100 for 
any j > 1. Since M > 10, we have j > 10, so we also have r 2 < j 2 r 2 /100. 
Combination with Lemma 5.4 now yields the bound, for j > 10, 

log v^~~ < 2.3j 2 r 2 /100 + 9(j + l) 2 r 2 /128. 

One easily checks that this is bounded by (1 — d2)j 2 r^/8, for di = 1/9 when 
j > 10. Thus the probability at stake is bounded from above by 




for d\ large enough. By assumption Ap n < n/2, so each binomial term is 
bounded by the last one. Using simple algebra this yields the second term in 
the bound of the theorem. The first term comes from P n ,e ^A^ ^ e~ r ™/ 8 . □ 

In view of (2.4) we have (J£ ) < (ne/Ap n ) Apn < e dir ™. Therefore, the 
right-hand side of Proposition 5.1 tends to zero. Combining this with The- 
orem 2.1 yields proofs of Theorems 2.2 and 2.4 for d q the square Euclidean 
norm d<i- 

The theorems for q S (0, 2) are a corollary of the case q = 2, by interpo- 
lation between the distances. Due to Holder's inequality, for any 9, 9q with 
\S US o \<Ap n , 

d q (9,9 )<\\9-9 \mAp n ) 1 -^ 2 . 
This implies, for any M > 0, if 9q £ -M^n], 

Pn,9 n n (d q (9,9 ) > Mrlpi-^X) 

<P ni g IL n (9:\S e \>(A-l)p n \X) 

+ PIU(\\9 -9 \\> M l l q A 1 l 2 ~ 1 l q r n \X). 

The first term on the right-hand side tends to zero for sufficiently large A. 
Next the second tends to zero for sufficiently large M . 

6. Proof of Theorem 2.6. The theorem is proved by bounding the (pos- 
terior) risk under a vector 9q G m s[Pn] by the risk under its projection into 
Iq Ip] obtained by setting the smallest n—p coordinates of 9q equal to zero. 
The value p that minimizes the expression that defines the rate r 2 is the 
optimal dimension of a projection, and the complicated expression itself is 
a trade-off of an approximation error and a rate. 
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The comparison between 9$ and its projection 9\ is made in the following 
lemma. 

Lemma 6.1. For any measurable function G and any 0o,0i in M. n , 

Proof. In view of the Cauchy-Schwarz inequality, 

P n ,e G < ^P nfil G^P nfil {^^j . 

The second integral on the right-hand side is equal to exp(||0o — 0i|| 2 ). □ 

Let be an index for which the minimum that defines the rate r 2 is 
attained. For given 6q belonging to m s \p n ], let 0\ denote the vector deduced 
from #o by keeping unchanged its p* largest components and putting the 
other ones to 0. By definition 9\ belongs to ^o[y>n] an d 




where the first inequality is obtained using the definition of the m s [p n ]-class, 
and the second follows by comparison of the series with an integral. 
Therefore, the triangle inequality implies 

n n (0:||0-0 o || >80r n + 20r|X) < LI n (0: ||0 - 6 X \\ > 79r n + 20r\X). 

By Lemma 6.1 the expectation of the right-hand side under P n g is bounded 
by 

(P n , ei n„(0 : |0 - 0i|| > 79r n + 2Ur\X)) l/2 e^" ^ ' 2 . 
Finally apply Theorem 2.5, with r of the theorem taken equal to 3.4r n + 2r. 

7. Proof of Theorems 2.8 and 2.9. The proof of Theorem 2.8 follows the 
approach to get lower bound type results introduced in [8], which uses the 
principle that sets with very little prior mass receive no posterior mass, see 
also Figure 2. 

Lemma 7.1. We have P nj g o ll n (0: ||0 - O || < s n \X) — > 0, for any s n for 
which there exist r n such that 

n n (0:||0-0 o || <s n ) _ r 2 

^ - = n(p ™] 

n n (0:||0-0 o ||<r ri ) 1 h 
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Lemma 7.2. There exist a constant C > such that if S C {1, . . . ,n} 
and r n is a sequence of real numbers such that r^ > \Sg \, it holds 

V ^ SnS ^ 1 < e C\S 6o \ 
V \So Q \ r l n 6 ° XSl 

Proof of Theorem 2.8. We first consider the (more complicated) 
case that 1 < a < 2. For this range of a an application of Holder's inequality 
gives that ||0|| a < ll^llp 1 ^ " 1 ^ 2 ! if P is the number of nonzero coordinates of 
a vector 9. Let us introduce 

_ ( PoWa a ,\ INI _ P0,a _ r n ( PoWa -1/2-1/c 

" VIWI 2 7 8 ' 64 " 8 

Then r n < ||#o||/8 an d s„ < r„/8. Also, 

u n (0:\\e-e o \\< Sn ) 

n n (9:\\6-6 \\<r n ) 



n n (0:||0-0 o ||<r n ) 



< 



v n ra (5) Gg n g (g 6 M gng ° : ||fl - 7r SnSo g || < s n ) 

V n n (5 ) G So (0 E RSo : ||# _ < rn ) \\*s o \s0 O \\<*n- 



Define 
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Then the ball in R s ° of radius s n around 9 b is contained in the ball of radius 
r n around tts 6o- It follows that the second-to-last display is bounded above 
by 

/ 7n \ - n n (5) s l n nSol vsnSoSup eeA gsns o (0) , 

with A = {9 G R SnS ° : \\9 - TTsnS Qo II < s n} and B = {9 G R s ° : \\9 - 9 B \\ < 
s n }. We finish the proof by bounding the densities gsns and 9s above and 
below on the given sets. 

If 9 G B, then by the triangle inequality followed by Holder's inequality, 

||#||a < II^bIU + ll# ~~ #b|U 

< I "lif) l|( '» ll " +i '»°" /2s "-( 1 -w) l|S ° IU ' 

because s n < r n /8 and pl/ a l ^ 2 s n < (r n /8)||#olU/||#o||- Similarly, if 9 G A 
and ||7r 5o \ 5 0o|| < s n , then ||7r 5o \ s o ||a <Pn a ~ 1/2 s n and 

||0||a > ll^olla - ll#o - nsr\S o 0o\\a ~ \\^sns &o - 0\\a 
>\\9 \\ a - 2pV-i/2 Sn > \ Ma U _ _?*_) . 

We deduce that, for any S such that ||7Ts \s^o|| < s n , denoting by c a the 
normalizing constant of the density x—¥ c a exp(— |x| a ), 



(?a sup g&A g SnSo {9) < 

< exp 



3r n 



1 



-2a(5/8) a "V. 



±P4J V 4||6» 

a-l„ ll^ol 



'4||0 O | 



<exp[-4a(5/8) a " i <] 



where to obtain the second last inequality we have used that for any < t < 
1/8 and a > 1 it holds (1 -t) a - (1 - 3t) a = at(l - ut) a ~ l du > 2crt(l - 
3/8) a_1 . Hence the expression in (7.1) is bounded above by 

^ U n (S ) [CaSn) 

llnl^Oj g 

< e - 4a (5/8) Q ~ 1 r ? 2 t e C*p„ g cp„ log (n/p„ ) 

2 

by Lemma 7.2. The right-hand side is of smaller order than e~ Vn . An appli- 
cation of Lemma 7.1 concludes the proof for the case that 1 < a < 2. 
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The proof in the case that a > 2 follows the same lines, except that we 
use the inequality ||0|| a < \\&\\i f° r every flsl p , without the factor p l l a ~ l l 2 
that is necessary if a < 2. We define s n = (r n /8)\\9o\\ a /\\8o\\- C 

Acknowledgment. The authors would like to thank Subhashis Ghosal for 
suggesting a simplified argument in the proof of Proposition 4.1. 

SUPPLEMENTARY MATERIAL 

Supplement to "Needles and Straw in a Haystack: Posterior concentra- 
tion for possibly sparse sequences" (DOI: 10.1214/12-AOS1029SUPP; .pdf). 
This supplementary file contains the proofs of some technical results appear- 
ing in the paper. 
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